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Abstract 

The use of variable selection methods is particularly appealing in statist¬ 
ical problems with functional data. The obvious general criterion for variable 
selection is to choose the ‘most representative’ or ‘most relevant’ variables. 
However, it is also clear that a purely relevance-oriented criterion could lead 
to select many redundant variables. The rnRMR (minimum Redundance Max¬ 
imum Relevance) procedure, proposed by Ding and Peng (2005) and Peng et 
al. (2005) is an algorithm to systematically perform variable selection, achiev¬ 
ing a reasonable trade-off between relevance and redundancy. In its original 
form, this procedure is based on the use of the so-called mutual information 
criterion to assess relevance and redundancy. Keeping the focus on functional 
data problems, we propose here a modified version of the rnRMR method, ob¬ 
tained by replacing the mutual information by the new association measure 
(called distance correlation ) suggested by Szekely et al. (2007). We have also 
performed an extensive simulation study, including 1600 functional experiments 
(100 functional models x 4 sample sizes x 4 classifiers) and three real-data ex¬ 
amples aimed at comparing the different versions of the mRMR methodology. 

The results are quite conclusive in favor of the new proposed alternative. 

Keywords: functional data analysis ; supervised classification ; distance 
correlation ; variable selection 

1 Introduction 

The use of high-dimensional or functional data entails some important practical is¬ 
sues. Besides the problems associated with computation time and storage costs, 
high-dimensionality introduces noise and redundancy. Thus, there is a strong case 
for using different techniques of dimensionality reduction. 

We will consider here dimensionality reduction via variable selection techniques. 
The general aim of these techniques is to replace the original high-dimensional (per¬ 
haps functional) data by lower dimensional projections obtained by just selecting 
a small subset of the original variables in each observation. In the case of func¬ 
tional data, this amounts to replace each observation {x(t), t e [0,1]} with a low¬ 
dimensional vector ... ,x(tk))- Then, the chosen statistical methodology (su¬ 

pervised classification, clustering, regression,...) is performed with the ‘reduced’, low¬ 
dimensional data. Usually the values t\, ..., f*. identifying the selected variables are 
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the same for all considered data. A first advantage of variable selection (when com¬ 
pared with other dimension reduction methods, as Partial Least Squares) is the ease 
of interpretability, since the dimension reduction is made in terms of the original vari¬ 
ables. In a way, variable selection appears as the most natural dimension reduction 
procedure in order to keep in touch, as much as possible, with the original data: see 
for instance Golub et al. (1999); [Lindquist and McKeague (2009) among many other 
examples in experimental sciences or engineering. In Golub et al. (1999) the authors 
note that 50 genes (among almost 7000) are enough for cancer subtype classification. 
Likewise, Lindquist and McKeague ( 2009[ ) point out that in some functional data 
regression (or classification) problems, as functional magnetic resonance imaging or 
gene expression, ‘the influence is concentrated at sensitive time points’. 

We refer to Guyon et al. (2006) for an account of different variable selection meth¬ 
ods in the multivariate (non-functional) case. A partial comparative study, together 


with some new proposals for the functional framework, can be found in Berrendero 


et al. (2014). 


Throughout this work we will consider variable selection in the setting of functional 
supervised classification (the extension to more general regression problems is also 
possible with some obvious changes). Thus, the available sample information is a 
data set of type T> n = ((Xl, Yi),..., (X n , Y n )) of n independent observations drawn 
from a random pair (X, Y). Here Y denotes a binary random variable, with values 
in {0,1}, indicating the membership to one of the populations P 0 or P\ and A, : are 
iid trajectories (in the space C[0,1] of real continuous functions on [0,1]), drawn 
from a stochastic process X = X(t). The supervised classification problem aims at 
predicting the membership class Y of a new observation for which only the variable X 
is known. Any function g n (x) = g n {x\ T) n ) with values in {0,1} is called a classifier. 


Several functional classifiers have been considered in the literature; see, e.g., Bafllo 


et al. (2011b) for a survey. Among them maybe the simplest one is the so-called 


h-nearest neighbours (k- NN) rule, according to which an observation x is assigned 
to P± if and only if the majority among their k nearest sample observations A, in 
the training sample fulfil 1} = 1. Here k = k n G N is a sequence of smoothing 
parameters which must satisfy k n —> oo and k n /n —* 0 in order to achieve consistency. 
In general, k- NN could be considered (from the limited experience so far available; 
see e.g., Bafllo et al. (2011a)) a sort of benchmark, reference method for functional 
supervised classification. Simplicity, ease of motivation and general good performance 
(it typically does not lead to gross classification errors) are perhaps the most attractive 
features of this method. Besides fc-NN, we have also considered (inspired in the 
paper by Ding and Peng Ding and Peng (2005) where a similar study is carried out) 
three additional classifiers: the popular Fisher’s linear classifier (LDA) used often 
in classical discriminant analysis, the so-called Naive Bayes method (NB) and the 
(linear) Support Vector Machine classifier (SVM). Note that, in our empirical studies, 
all the mentioned classifiers (k- NN, LDA, NB and SVM) are used after the variable 
selection step , on the ‘reduced data’ resulting from the variable selection process. 
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In fact, as we will point out below, the main goal of our study is not to compare 
different classifiers. We are rather concerned with the comparison of different meth¬ 
ods for variable selection (often referred to as feature selection). A relevant procedure 
for variable selection, especially popular in the machine learning community, is the 
so-called minimum Redundancy Maximum Relevance (mRMR) method. It was pro¬ 
posed by Ding and Peng (2005) and Peng et al. (2005) as a tool to select the most 
discriminant subset of variables in the context of some relevant bioinformatics prob¬ 
lems. See also Battiti (1994); Kwak and Choi (2002); Yu and Liu (2004) for closely 
related ideas. 


The purpose of this paper. Overall, we believe the mRMR procedure is a very natural 
way to tackle the variable selection problem if one wants to make completely explicit 
the trade-off relevance/redundancy. The method relies on the use of an association 
measure to assess the relevance and redundancy of the considered variables. In the 
original papers the so-called ‘mutual information’ measure was used for this purpose. 
The aim of the present paper is to propose other alternatives for the association meas¬ 
ure, still keeping the main idea behind the mRMR procedure. In fact, most mRMR 
researchers admit that there is considerable room for improvement. We quote from 
the discussion in Peng; et al. (2005): ‘The mRMR paradigm can be better viewed as 
a general framework to effectively select features and allow all possibilities for more 
sophisticated or more powerful implementation schemes ’. In this vein, we consider 
several versions of the mRMR and compare them by an extensive empirical study. 
Two of these versions are new: they are based on the ‘distance covariance’ and ‘dis¬ 
tance correlation’ association measures proposed by[Szekely et al. (2007). Our results 
suggest (and this is the main conclusion of our study) that the new version based 
on the distance correlation measure represents a clear improvement of the mRMR 
methodology. 

The rest of the paper is organized as follows. Section [2] contains a brief summary 
and some remarks about the mRMR algorithm. The different association measures 
under study (which are used to define the different versions of the mRMR method) 
are explained in Section [3j with especial attention to the correlation of distances. 


Szekely et al. (2007); Szekely and Rizzo (2009) The empirical study, consisting of 


1600 simulation experiments and some representative real data sets, is explained in 
Section |4} Finally, some conclusions are given. 


2 The trade-off relevance/redundancy. The mRMR criterion 

When faced with the problem of variable selection methods in high-dimensional (or 
functional) data sets, a natural idea arises at once: obviously, one should select the 
variables according to their relevance (representativeness). However, at the same time, 
one should avoid the redundancy which appears when two highly relevant variables are 
closely associated to each other. In that case, one might expect that both variables 
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essentially carry the same information, so that to choose just one of them should 
suffice. 

The mRMR variable selection method, as proposed in Ding and Peng (2005); 


Peng et al. (2005), provides a formal implementation of a variable selection procedure 


which explicitly takes into account this trade-off relevance/redundancy. 

In our functional binary classification problem, the description of the mRMR 
method is as follows: the functional explanatory variable X(t), t G [0,1] will be used 
in a discretized version (X(fi),..., X(ijv))- When convenient, the notations X t and 
X(t) will be used indistinctly. For any subset S of {iy,..., t at}, the relevance and the 
redundancy of S are defined, respectively, by 


= (1) 
v J tes 

and 

Red(S) = L„. X ‘1 • ( 2 ) 

card (S) y- 

where card(S') denotes the cardinality of S and /(•, •) is an ‘association measure’. This 
function / measures how much related are two variables. So, it is natural to think 
that the relevance of X t is measured by how much related it is with the response 
variable Y, that is /( X tl Y), whereas the redundancy between X t and X s is given by 
I(X S , X t ). Now, in summary, the mRMR algorithm aims at maximizing the relevance 
avoiding an excess of redundancy. 

The choice of the association measure / is a critical aspect in the mRMR meth¬ 
odology. In fact, this is the central point of the present work so that we will consider 
it in more detail later. By now, in order to explain how the mRMR method works, 
let us assume that the measure / is given: 


(a) The procedure starts by selecting the most relevant variable, given by the value 
ti such that the set Si = {ij} maximizes Rcl(S') among all the singleton sets of 
type Sj = {tj}. 

(b) Then, the variables are sequentially incorporated to the set S of previously 
selected variables, with the criterion of maximizing the difference Rel(S') — 
Red(S) (or alternatively the quotient Rel(S)/Red(S)). 

(c) Finally, different stopping rules can be considered. We set the number of vari¬ 
ables through a validation step (additional details can be found in Sections [ 3 ] 
and [5]) . 

In practice, the use of the mRMR methodology is especially important in the 
functional data problems, where those variables which are very close together are 
often strongly associated. 
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Figure 1: Mean functions for both classes considered in the Tecator data set (first 
derivative). Left panel shows the five variables selected by Maximum Relevance. 
Right panel corresponds to the variables selected by niRMR. 


The following example shows to what extent the mRMR makes a critical differ¬ 
ence in the variable selection procedure. It concerns the well-known Tecator data set 
(a benchmark example very popular in the literature on functional data; see Section 
[5] for details). To be more specific, we use the first derivative of the curves in the 
Tecator data set, which is divided into two classes. We first use a simple ‘ranking 
procedure’, where the variables are sequentially selected according to their relevance 
(thus avoiding any notion of redundancy). The result is shown in the left panel of 
Figure [I] (the selected variables are marked with grey vertical lines). It can be seen 
that in this case, all the five selected variables provide essentially the same informa¬ 
tion. On the right panel we see the variables selected from mRMR procedure which 
are clearly better placed to provide useful information. This visual impression is con¬ 
firmed by comparing the error percentages obtained from a supervised classification 
method using only the variables selected by both methods. While the classification 
error obtained with the mRMR selected variables is 1.86%, the corresponding error 
obtained with those of the ranking method is 4.09%. 

3 Association measures 

As indicated in the previous section, the mRMR criterion relies on the use of an 
association measure /(A", Y) between random variables. The choice of appropriate 
association measures is a classical issue in mathematical statistics. Many different 
proposals are available and, in several aspects, this topic is still open for further 
research, especially in connection with the use of high-dimensional data sets (arising, 
e.g., in genetic microarray examples,Reshef et al. (2011); Hall and Miller (2011)). 

A complete review of the main association measures for random variables is clearly 


Reshef et al. (2011); Hall and Miller (2011 
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beyond the scope of this paper. So, we will limit ourselves to present here the measures 
I(X,Y) we have used in this work: 


1. The ordinary correlation coefficient between X and Y (in absolute value). This 
is the first obvious choice for the association measure I(X, Y). It clearly presents 
some drawbacks (it does not characterize independence and it is unsuitable to 
capture non-linear association) but still, it does a good job in many practical 
situations. 


2. The Mutual Information Measure , MI(X,Y) is defined by 


MI(X, Y) = j lo g ^^ )P{ x,y)d^yl 


( 3 ) 


where A", Y are two random variables with respective p-densities p\ and P 2 ; 
in the standard, absolutely continuous case, p would be the product Lebesgue 
measure. In the discrete case, p would be a counting measure on a countable 
support. The joint density of (A, Y) is denoted by p(x, y ). 


This is the association measure used in the original version of the mRMR pro¬ 
cedure JDing and Pengi (120051); iPeng et al. (120051). 


It is clear that MI(X,Y ) measures how far is p(x,y) from the independence 
situation p(x, y) = Pi(x)p 2 (y). It is easily seen that MI(X, Y) = MI{Y , A") and 
MI(X, Y) = 0 if and only if X and Y are independent. 


In practice, MI(X,Y ) must be approximated by considering, if necessary, ‘dis¬ 
cretized versions’ of X and Y, obtained by grouping their values on intervals 
represented by suitable label marks, a*, bj. This leads to approximate expres¬ 
sions of type 


MT(X, Y) = E log 


P(X = ai ,Y = bj) 
P(X = cp)P(X = bj) 


P {X = Oi,Y 


bj), 


( 4 ) 


where, in turn, the probabilities can be empirically estimated by the corres¬ 
ponding relative frequencies. In Ding and Peng (2005) the authors suggest a 
threefold discretization pattern, i.e., the range of values of the variable is discret¬ 
ized in three classes. The limits of the discretization intervals are defined by the 
mean of the corresponding variable ±cr/2 (where a is the standard deviation). 
We will explore this criterion in our empirical study below. 

3. The Fisher-Correlation (FC) criterion: It is a combination of the F-statistic, 


F(x Y) = Z k n k (X k -Xy/(K- 1 ) 
1 ’ J E k (n k -lK/(n-K) 


( 5 ) 
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used in the relevance measure ([!]), and the ordinary correlation, C, used in the 
redundancy measure (J2|. In the expression (J5]) , K denotes the number of classes 
(so K = 2 in our binary classification problem), X denotes the mean of X, Xk 
is the mean value of X of the elements belonging the k- th class, for k = 0,1, and 
ns, and o\ are the sample size and the variance of the fc-th class, respectively. 


Ding and Peng (2005) suggest that, in principle, this criterion might look more 


useful than MI when dealing with continuous variables but their empirical 
results do not support that idea. Such results are confirmed by our study so 
that, in general terms, we conclude that the mutual information (4) is a better 
choice even in the continuous setting. 


4. Distance covariance: this is an association measure recently proposed by Szekely 


et al. (2007). Denote by Yx, <Py the characteristic functions of (X,Y), 

X and Y. respectively. Here A" and Y denote multivariate random variables 
taking values in M p and M 9 , respectively (note that the assumption p = q is not 
needed). Let us suppose that the components of X and Y have finite first-order 
moments. The distance covariance between X and Y is the non-negative value 
V(A, Y) defined by 


V 2 (X,F) 



<Px,y{u,v) - <px(u)(py(v) 


w(u, v)dudv, 


( 6 ) 


with w(u,v ) = (cpCg|w|p +p |w|g +9 ) 1 , where q = y i s half the surface area 

of the unit sphere in M d+1 and | • ^ stands for the Euclidean norm in M d . 


While definition ([6]) has a rather technical appearance, the resulting associ¬ 
ation measure has a number of interesting properties. Apart from the fact 
that <§ allows for the case where X and Y have different dimensions, we have 
V 2 (X, Y) = 0 if and only if X and Y are independent. Moreover, the indic¬ 
ated choice for the weights w(u,v ) provides valuable equivariance properties 
for V 2 (X, Y) and the quantity can be consistently estimated from the mutual 
pairwise distances \X l — Xj\ p and \Y t — Y 3 \ q between the sample values X, and 
Yj (no discretization is needed). 


We refer to Szekely et al. (2007); Szekely and Rizzo (2009, 2012, 2013) for a 


detailed study of this increasingly popular association measure. We refer also 
to Berrendero et al. (2014) for an alternative use (not related to mRMR) of 


V 2 (A", Y) in variable selection. 


5. 


Distance correlation: this is just a sort of standardized version of the distance 


covariance. If we denote V 2 (X) = V 2 (A", A), the (square) distance correlation 
between X and Y is defined by X 2 (X,Y) = if V 2 (X)V 2 (Y) > 0, 


X 2 (X, Y) = 0 otherwise. 
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Of course, other association measures might be considered. However, in order to 
get an affordable comparative study, we have limited our study to the main association 
measures previously used in the mRMR literature. We have only added the new 
measures V 2 and X 2 , which we have tested as possible improvements of the method. 


Also, alternative versions of the 111 RMR procedure have been proposed in liter¬ 
ature. In particular, the Mutual Information measure could be estimated by kernel 


density estimation 

Wand and Jones 

(1995 

). Regarding the kernel-based estimation 

of the MI measure, the crucial issue 

Cao et al. 

(1994 

) of the optimal selection of the 

smoothing parameter has not been, to our knowledge, explicitly addressed; note that 


here ‘optimal’ should refer to the estimation of MI. Likewise, other weighting factors 


might be used instead of just card(S) in equation (2).Estevez et al. (2009). However, 
still the ‘original’ version of mRMR (with discretization-based MI estimation) seems 


to be the most popular standard; see |Mandal and Mukhopadhyay| ( |2014[ ); |Nguyen et 
al 


(2014) for very recent examples. 


Let us finally note that all the association measures we are considering take pos¬ 
itive values. So, the phenomena associated with the the negative association values 


analyzed in Dernier et al. (2013) do not apply in this case. 


Notation. The association measures defined above will we denoted in the tables 
of our empirical study by C, MI, FC, V and R, respectively. 


4 The simulation study 

We have checked five different versions of the mRMR variable selection methodology. 
They have been obtained by using different association measures (as indicated in the 
previous section) to assess relevance and redundancy. 

In all cases, the comparisons have been made in the context of problems of bin¬ 
ary supervised classification, using 100 different models to generate the data (X, Y). 
These models are defined either by 

(i) specifying the distributions of X\Y = 0 and X\Y = 1; in all cases, we take 

p = p(y = o) = i/2. 

(ii) specifying both the marginal distribution of X and the conditional distribution 
r](x) = P(y = 1\X = x). 

Our experiments essentially consist of performing variable selection for each model 
using the different versions of mRMR and evaluating the results in terms of the re¬ 
spective probabilities of correct classification when different classifiers are used on the 
selected variables. The full list of considered models is available at the Supplemental 
material document. All these models have been chosen in such a way that the op¬ 
timal (Bayes) classification rule depends on just a finite number of variables. The pro¬ 
cesses considered include Brownian motion (with different mean functions), Brownian 






























bridge and several other Gaussian models, in particular the Ornstein-Uhlenbeck pro¬ 
cess. Other mixture models based on them are also considered. All these models are 
generated according to the pattern (i) above. In addition, we have considered several 
‘logistic-type’ models, generated by using pattern (ii). 

For each considered model all the variable selection methods (C, MI, etc.) are 
checked for four sample sizes, n = 30, 50, 100, 200 and four classification methods 
(fc-NN. LDA, NB and SVM). So, we have in total 100 x 4 x 4 = 1600 simulation 
experiments. 


4.1 Classification methods 


We have used the four classifiers considered in the paper by Ding and Peng |Ding| 


and Peng (2005), except that we have replaced the logistic regression classifier (which 


is closely related to the standard linear classifier) with the non-parametric k -NN 
method. All of them are widely known and details can be found, e.g. 


m 


Hastie et al. 


(2005). 


Naive Bayes classifier (NB). This method relies on the assumption that the 
selected variables are Gaussian and conditionally independent in each class. So 
a new observation is assigned according to its posterior probability calculated 
from the Bayes rule. Of course the independence assumption will often fail 


(especially in the case of functional data). However, as shown in Ding and Peng 


(2005), this rule works as an heuristics which offers sometimes a surprisingly 
good practical performance. 


• The A;-Nearest Neighbors classifier (Ic-NN). According to this method 
(already commented in the introduction of the paper) a new observation is 
assigned to the class of the majority of its k closest neighbors. We use the usual 
Euclidean distance (or L 2 -distance when the method is used with the complete 
curves) to define the neighbors. The parameter k is fitted through the validation 
step, as explained below. 


Linear Discriminant Analysis (LDA). The classic Fisher’s linear discrimin¬ 
ant is, still today, the most popular classification method among practitioners. 
It is know to be optimal under gaussianity and homoscedasticity of the distri¬ 
butions in both populations but, even when these conditions are not fulfilled, 
LDA tends to show a good practical performance in many real data sets. See, 
e.g., Hand (2006). 


• Support Vector Machine (SVM). This is one of the most popular classi¬ 
fication methodologies in the last two decades. The basic idea is to look for 
the ‘best hyperplane’ in order to maximize the separation margin between the 
two classes. The use of different kernels (to send the observations to higher 
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dimensional spaces where the separation is best achieved) is the most distinct¬ 


ive feature of this procedure. As in Ding and Peng (2005) we have used linear 
kernels. 

As an objective reference, our simulation outputs include also the percentages of 
correct classification obtained with those classifiers based on the complete curves, i.e., 
when no variable selection is done at all (except for LDA whose functional version 
is not feasible; see [Bai'llo et al. (2011b)). This reference method is called Base. 
A somewhat surprising conclusion of our study is that this Base method is often 
outperformed by the variable selection procedures. This could be due to the fact 
that the whole curves are globally more affected by noise than the selected variables. 
Thus, variable selection is beneficial not only in terms of simplicity but also in terms 
of accuracy. 


4.2 Computational details 

All codes have been implemented in MATLAB and are available from the authors 
upon request. We have used our own code for k -NN and LDA (which is a faster 
implementation of the MATLAB function classify). The Naive Bayes classifier is 
based on the MATLAB functions NaiveBayes.fit and predict. The linear SVM has 
been performed with the MATLAB version of the LIBLINEAR library (see Fan et 
al. (2008)) with bias and solver type 2, which obtains (with our data) very similar 
results to those of the default solver type 1 but faster. The mRMR method has 
been implemented in such a way that different association measures can be used to 
define it. An online implementation of the original mRMR method can be found in 
http://penglab.janelia.org/proj/mRMR/ . 


Following ’ 

ding and Peng ( 

2005 

), the criteria (flj) and p| to assess relevance and 

redundancy, respectively, are in fact replaced by approximate expressions, numbered 

(6) and (7) in 

Ding and Peng 

(2005 

): as these authors point out, their expression 


(6) is equivalent to the relevance criterion (|TJ) while (7) provides an approximation 
for the minimum redundancy criterion (j2j) . The empirical estimation of the distance 
covariance (and distance correlation) implemented is the one proposed in Szekely et 


al. (2007) expression (2.8). 

All the functional simulated data are discretized to (x(ti),..., x(fioo))> where 
ti are equi-spaced points in [0,1]. There is a partial exception in the case of the 
Brownian-like model, where (to avoid the degeneracy x(t 0 ) = 0) we take t\ = 5/105. 
Also (for a similar reason), a truncation is done at the end of the interval [0,1] in 
those models including the Brownian Bridge. 

The number k of nearest neighbours in the k -NN classifier, the cost parameter C 
of the SVM classifier and the number of selected variables are chosen by standard 


validation procedures.Guyon et al. (2006). To this end, in the simulation study, we 


have generated independent validation and test samples of size 200. Each simulation 
output is based on 200 independent runs. 
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4.3 A few numerical outputs from the simulations 

We present here just a small sample of the entire simulation outputs, which can be 
downloaded from www.uam.es/antonio.cuevas/exp/mRMR-outputs.xlsx . Some 
additional results, including a complete list of the considered models, can be found 
in the Supplemental material hie. 

Tables [l] - [4] contain the results obtained with NB, k- NN, LDA and SVM respect¬ 
ively. The boxed outputs in these tables correspond to the winner and second best 
method in each row. The columns headings (MID, FCD, etc.) correspond to the 
different mRMR methods based on different association measures, as defined in Sec¬ 
tion [3] (see the respective notations at the end of that section). The added letter ‘D’ 
refers to the fact that global criterion to be maximized is just the difference between 
the measures ([!]) and ([2]) of relevance and redundancy, respectively. There are other 
possibilities to combine (JTj) and ([2]). One could take for instance the quotient. The 
corresponding outputs methods are denoted MIQ, FCQ, etc. in our supplementary 
material hies. However, the outputs are not given here for the sake of brevity. In 
any case, our results suggest that the difference-based methods are globally (although 
not uniformly) better than those based on quotients. The column ‘Base’ gives the 
results when no variable selection method is used (that is, the entire curves are con¬ 
sidered). This column does not appear when the LDA method is used, since LDA 
cannot directly work on functional data. 

The row entries ‘Average accuracy’ provide the average percentage of correct clas¬ 
sification over the 100 considered model outputs; recall that every output is in turn 
obtained as an average over 200 independent runs. The rows ‘Average dim. red.’ 
provide the average numbers of selected variables. The average number of times that 
every method beats the ‘Base’ benchmark procedure is given in ‘Victories over Base’. 

It can be seen from these results that the global winner is the R-based 111 RMR 
method, with a especially good performance for small sample sizes. Note that the 
number of variables required by this method is also smaller, in general, than that 
of the remaining methods. Moreover, RD is the most frequent winner with respect 
to the Base method (with all classifiers) keeping, in addition, a more stable general 
performance when compared with the other variable selection methods. In this sense, 
R-based methods seem both efficient and reliable. In agreement with the results in 

), the performance of the FC-based method is relatively poor. 
Base option (which uses the entire curves) is never the winner, 
with the partial exception of the SVM classifier. 


Ding and Peng (2005 


Finally, note that the 


4.4 Ranking the methods 

It is not easy to draw general conclusions, and clear recommendations for practition¬ 
ers, from a large simulation study. A natural idea is to give some kind of quantitative 
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Table 1: Performance outputs for the considered methods, using NB and the difference 
criterion, with different sample sizes. Each output is the result of the 100 different 
models for each sample size. 


Output (NB) 

Sample size 

MID 

FCD 

RD 

VD 

CD 

Base 

Average accuracy 

n = 30 

78.08 

78.42 

79.56 

79.24 

79.28 

77.28 


n = 50 

79.64 

79.34 

80.92 

80.45 

80.46 

78.29 


n = 100 

80.76 

80.06 

81.90 

81.34 

81.41 

78.84 


n = 200 

81.46 

80.44 

82.55 

81.90 

82.05 

79.13 


Average dim. red 

n = 30 

n = 50 

n = 100 

n = 200 

8.7 

7.9 

7.2 

6.6 

9.3 

9.0 

8.5 

8.1 

7.2 

6.8 

6.3 

5.8 


7.1 

6.7 

6.2 

5.7 

7.8 

7.4 

6.8 

6.4 

100 

100 

100 

100 

Victories over Base 

n = 30 

57 

61 

77 


71 

69 

- 


n = 50 

66 

61 

79 


74 

70 

- 


n = 100 

77 

61 

88 

81 

85 

- 


n = 200 

84 

62 

93 

85 

91 

- 


Table 2: Performance outputs for the considered methods, using k- NN and the dif¬ 
ference criterion, with different sample sizes. Each output is the result of the 100 
different models for each sample size. 


Output (A;-NN) 
Avgerage accuracy 


Average dim. red 


Victories over Base 


Sample size MID FCD 


n = 30 

n = 50 

80.09 

81.43 

79.26 

79.91 

n = 100 

n = 200 

83.01 

80.76 

81.34 

84.28 

n = 30 

9.2 

9.8 

n = 50 

9.3 

9.9 

n = 100 

9.6 

10.2 

n = 200 

9.8 

10.4 


n = 30 

71 

51 

n = 50 

71 

45 

n = 100 

71 

38 

n = 200 

73 

33 


RD 


VD CD Base 


81.30 

82.44 

83.82 

84.89 


80.54 80.40 78.98 

81.471 81.33 80.34 

82.54 82.32 81.99 

83.37 83.15 83.38 


7.7 

8.3 

8.0 

100 

7.9 

8.5 

8.1 

100 

8.2 

8.7 

8.3 

100 

8.5 

00 

bo 

8.7 

100 


72 

70 

60 

56 


69 

68 

65 

58 
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Table 3: Performance outputs for the considered methods, using LDA and the dif¬ 
ference criterion, with different sample sizes. Each output is the result of the 100 
different models for each sample size. 

Output (LDA) Sample size MID FCD RD VD CD Base 

Avgerage accuracy n = 30 78.72 76.87 79.35 78.23 78.37 



n = 50 6.5 5.9 [5(9] [KK] 6.1 

n = 100 7.9 7.5 7A_ _C8_ 7.4 

n = 200 _ 9.0 8.9 [Ko] [sTo] 8.3 


Table 4: Performance outputs for the considered methods, using SVM and the dif¬ 
ference criterion, with different sample sizes. Each output is the result of the 100 
different models for each sample size. 


Output (SVM) 


Sample size 


Avgerage accuracy n = 30 
n = 50 
n = 100 

_ n = 200 

Average dim. red n = 30 
n = 50 


79.41 81.50 80.35 80.51 

80.01 82.45 81.00 81.20 

80.75 83.45 81.77 82.00 

81.27 84.22 82.38 82.61 
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assessment summarizing the relative merits of the different procedures. Many differ¬ 
ent ranking criteria might be considered. Following Berrendero et al. (2014), we have 
considered here the following ones: 


• Relative ranking: for each considered model and sample size the winner 
method (in terms of classification accuracy) gets 10 score points and the method 
with the worst performance gets 0 points. The score of any other method, with 
performance u, is defined by 10(u — w)/(W — w), where W and w denote, 
respectively, the performances of the best and the worst method. 

• Positional ranking: The winner gets 10 points, the second best gets 9, etc. 

• FI ranking: the scores are assigned according to the current criteria in a 
Formula 1 Grand Prix: the winner gets 25 score points and the following ones 
get 18, 15, 10, 8, 6, and 4 points. 

The summary results are shown in Tables [5] - [8] and a visual version of the complete 
(400 experiments) relative ranking outputs for the k- NN classifier is displayed in 
Figure [2] (analogous figures for the other classification methods can be found in the 
Supplemental material document). The conclusions are self-explanatory and quite 
robust with respect to the ranking criterion. The mRMR methods based on the 
distance correlation measure are the uniform global winners. The results confirm the 
relative stability of R, especially when compared with MI whose good performance is 
restricted to a few models. 

Of course, the criteria for defining these rankings, as well as the idea of averaging 
over different models, are questionable (although one might think of a sort of Bayesian 
interpretation for these averages). Anyway, this is the only way we have found to 
provide an understandable summary for such a large empirical study. On the other 
hand, since we have made available the whole outputs of our experiments, other 
different criteria might be used by interested readers. 


5 Real data examples 


We have chosen three real-data examples on the basis of their popularity in the 
literature on Functional Data Analysis: we call them Growth (93 growth curves in 
boys and girls), Tecator (215, near-infrared absorbance spectra from finely chopped 
meat) and Phoneme (1717 log-periodograms corresponding to the pronounciation of 
the sounds ‘aa’ and ‘ao’). The respective dimensions of the considered discretizations 
for these data are 31, 100 and 256. The second derivatives are used for the Tecator 
data. There are many references dealing with these data sets so we will omit here a 


detailed description of them. See, for example Ramsay and Silverman (2005), Ferraty 


and Vieu (2006) and Hastie et al. (2005), respectively, for additional details 


The methodology followed in the treatment of these data sets is similar to that 
followed in the simulation study, with a few technical differences. For Tecator and 
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Table 5: Global scores of the considered methods under three different ranking criteria 
using NB. Each output is the average of 100 models 



Table 6: Global scores of the considered methods under three different ranking criteria 
using k- NN. Each output is the average of 100 models 
Ranking criterion {k- NN) Sample size MID FCD RD YD CD 
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Table 7: Global scores of the considered methods under three different ranking criteria 
using LDA. Each output is the average of 100 models 
Ranking criterion (LDA) Sample size MID FCD RD YD CD 



Table 8: Global scores of the considered methods under three different ranking criteria 
using SVM. Each output is the average of 100 models 
Ranking criterion (SVM) Sample size MID FCD RD VD CD 
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Simulation experiments 


Figure 2: Cromatic version of the global relative ranking table taking into account 
the 400 considered experiments (columns) and the difference-based mRMR versions 
with the k -NN classifier: the darker de better. 


Growth data sets, a standard leave-one-out cross-validation is used. Such a procedure 
turns out to be too expensive (in computational terms) for the Phoneme data set. 


So in this case we have carried out 50-fold cross validation; see, for example, (Hastie 
et al, 2005, Sec. 7.10) for related ideas. 


A summary of the comparison outputs obtained for these data sets using the dif¬ 
ferent mRMR criteria (as well as the benchmark ‘Base’ comparison, with no variable 
selection) is given in Table [9] Again, the letter D in MID, FCD, etc. indicates that the 
relevance and redundancy measures are combined by difference. The analogous out¬ 
puts using the quotient (instead of the difference) can be found in the Supplemental 
material hie. 

The conclusions are perhaps less clear than those in the simulation study. The lack 
of a uniform winner is apparent. However, the R-based method is clearly competitive 
and might even be considered as the global winner, taking into account both, accuracy 
and amount of dimension reduction. The Tecator outputs are particularly remarkable 
since RD and VD provide the best results (with three different classifiers) using just 
one variable. Again, variable selection methods beat here the ‘Base’ approach (except 
for the Growth example) in spite of the drastic dimension reduction provided by the 
mRMR methods. 

6 Final conclusions and comments 


The mRMR methodology has become an immensely popular tool in the machine 


learning and bioinformatics communities. For example, the papers by Ding and Peng 


(2005) and Peng et al. (2005) had 819 and 2430 citations, respectively on Google 
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Table 9: Performances of the different mRMR methods in three data sets. From 
top to bottom tables stand for Naive Bayes, k- NN, LDA and linear SVM outputs 
respectively. 


NB outputs 


Output Data MID FCD RD YD CD Base 


Classification accuracy Growth 92.47 

87.10 

97.67 

89.25 

87.10 86.02 84.95 

Tecator 98.60 

99.53 

: r 

99.53 

| 98.14 97.21 

Phoneme 79.03 J 

80.27 


80.49 

79.39 80.14 74.08 

Number of variables Growth 2.0 

Tecator 2.0 

Phoneme 12.6 

1.1 

5.9 


2.2 

1.0 


1.0 

1.0 

1.3 31 

3.3 100 

15.9 256 

10.3 

15.8 

5.8 


k- NN outputs 


Output Data MID FCD 


Classification accuracy 

Growth 

95.70 


83.87 

99.07 

80.48 

Tecator 

Phoneme 

99.07 

80.14 

Number of variables 

Growth 

3.5 

1.0 


Tecator 

5.7 

3.0 


Phoneme 

15.4 

13.3 


RD VD 


94.62 91.40 


99.53 


99.53 

81.14 

80.31 


2.5 4.8 


1.0 


1.0 


CD 


84.95 

99.07 


I 

80.55 

1 


1.1 



4.0 


17.7 


16.5 


10.7 


Base 

96.77 

98.60 

78.80 

31 

100 

256 


LDA outputs 


Output 


Data 


MID FCD 


RD 


VD 


CD 


Classification accuracy Growth 

94.62 

91.40 

94.62 


94.62 

Tecator 

95.81 

93.95 

94.88 

95.81 

Phoneme 

79.50 

79.34 

79.21 

79.39 


Base 


89.25 

94.88 


79.98 


Number of variables 


Growth 


3.4 


5.0 


3.1 


4.2 


5.0 


Tecator 

2.6 

8.8 

1_1 

5.6 

5.0 


5.0 

- 

Phoneme 

19.1 

8.8 

14.6 

17.1 

12.0 

1 - 


SVM outputs 


Output 

Data 

MID 

FCD 

RD 

VD 

CD 

Base 

Classification accuracy 

Growth 

94.62 

87.10 

94.62 

95.70 

86.02 

95.70 


Tecator 

98.14 

99.07 

99.53 


99.53 

98.60 

99.07 


Phoneme | 

80.90 

80.83 

80.67 

80.78 

80.67 

80.96 


Number of variables 

Growth 

3.4 

5.0 

2.5 

4.2 


Tecator 

6.7 

2.0 

1.0 

1.0 


Phoneme 

18.5 

8.6 

16.2 

16.7 

16.0 

256 
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Scholar (by October 2, 2014). As we have mentioned, these authors explicitly pointed 
out the need of further research, in order to get improved versions of the mRMR 
method. The idea would be to keep the basic mRMR paradigm but using other 
association measures (besides the mutual information). This paper exactly follows 
such line of research, with a particular focus on the classification problems involving 
functional data. 

We think that the results are quite convincing: our extensive simulation study 
(based on 1600 simulation experiments and real data) places the mRMR method 
based in the R association measure by Szekely et al. (2007) globally above the ori¬ 
ginal versions of the mRMR paradigm. This is perhaps the main conclusion of our 
work. The good performance of the distance correlation in comparison with the 
other measures can be partially explained by the facts that this measure captures 
non-linear dependencies (unlike C and FC), has a simple smoothing-free empirical 
estimator (dissimilar to MI) and is normalized (different from V). 

There are, however, some other more specific comments to be made. 


1. First of all, variable selection is worthwhile in functional data analysis. Accuracy 
can be kept (and often improved) using typically less than the 10% of the original 
variables, with the usual benefits of the dimension reduction. This phenomenon 
happens for all the considered classifiers. 

2. The average number of selected variables with the R- or V-based methods is also 
smaller than that of MI and FC (that is, the standard mRMR procedures). This 
entails an interpretability gain: the fewer selected variables, the stronger case 
for interpreting the meaning of such selection in the context of the considered 
problem. 


3. The advantage of the R-based methods over the remaining procedures is more 
remarkable for the case of small sample sizes. This looks as a promising conclu¬ 
sion since small samples are very common in real problems (e.g. in biomedical 
research). 


4. 


In those problems involving continuous variables there is a case for using non- 
parametric kernel density estimators in the empirical approximation of the mu¬ 
tual information criterion. However, these estimators are known to be highly 
sensitive to the selection of the smoothing parameter which can be seen as an 
additional unwelcome complication. On the other hand, the results reported so 
far (e.g. in Peng et al. (2005])) do not suggest that kernel estimators will lead to 


a substantial improvement over the simplest, much more popular discretization 
estimators (see e.g. Mandal and Mukhopadhyay (2014); Nguyen et al. (2014)). 


5. Still in connection with the previous remark, it is worth noting the lack of 


smoothing parameters in the natural estimators of V and R.Szekely et al. (2007) 
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This can be seen as an additional advantage of the R- or V-based mRMR 
method. 

6. The better performance of R when compared with V can be explained by the 
fact that R is normalized so that relevance ([Tj) and redundancy (J2| are always 
measured ‘in the same scale’. Otherwise, one of these two quantities could be 
overrated by the mRMR algorithm, specially when the difference criterion is 
used. 

7. The method FCD (sometimes suggested in the literature as a possible good 
choice) does not appear to be competitive. It is even defeated by the simple 
correlation-based method CD. 


8. In general, the difference-based methods are preferable to their quotient-based 
counterparts. The quotient-based procedures are only slightly preferable when 
combined with methods (FC, V) where relevance and redundancy are expressed 
in different scales. The outputs for these quotient-based methods can be found 
in the complete list of results www. uam. es/antonio. cuevas/exp/mRMR-outputs 
xlsx, and a summary is available in Supplemental material document. 


9. We should emphasize again that the goal of this paper is to propose new ver¬ 
sions of the mRMR method and to compare them with the standard ones. 
Therefore, a wider study involving comparisons with other dimension reduction 
methods, is beyond the scope of this work. The recent paper by [Berrendero 


et al. (2014) includes a study of this type (always in the functional setting) 


whose conclusions suggest that mRMR might be slightly outperformed by the 
Maxima-Hunting (MH) procedure proposed by these authors. It also has a 
very similar performance to that of Partial Least Squares (PLS), although PLS 
is harder to interpret. Moreover, the number of variables selected by MH is 
typically smaller than those required by mRMR. 


10. Finally, if we had to choose just one among the considered classification meth¬ 
ods, we should probably take k- NN. The above commented advantages in terms 
of ease of implementation and interpretability do not entail any significant price 
in efficiency. 
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The Supplemental material document contains: the complete list and description of 
all functional models, the summary Tables 0-0 with the quotient criterion instead 
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of the difference one, figures analogous to Figure [2] with NB, LDA and SVM, and 
some new tables with a few simulation results. All outputs (with both difference and 
quotient criteria) of the 1600 simulation experiments and real data can be found at 
www.uam.es/antonio.cuevas/exp/mRMR-outputs.xlsx, 
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Supplementary material for the paper “The mRMR variable 
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7 List of models used in the simulation study and a few example outputs 


Our simulation study consists of 400 experiments based on 100 different underlying 
models. The optimal classification rule in each case depends only on a finite number 
of variables. Models differ in complexity and number of relevant variables. The 
processes involved are chosen among the following: first, the standard Brownian 
Motion, B. Second, BT denotes a Brownian Motion with a trend m(t), i.e., 
BT(t ) = B(t) + m(t)i we have considered several choices for m(t), a linear trend, 
m(t) = ct, a linear trend with random slope, i.e., m(t) = Qt, where 6 is a Gaussian 
r.v., and different members of two parametric families: the peak functions <& m .k and 
the hillside functions, defined by 


*&m,k — / ^Pm,k{^ds 


hillside^ ; b(t) = b(t - £ 0 )I[t 0 ,oo), 


where, ( p m k(t ) = V2 m 1 I W-a 2 t-n — I( 2 k-i 2 ^\ for m e N, 1 < k < 2 m 1 . Third, 

L V 2 m ’ 2 m ) \ 2 m ’2 m )_ 

the Brownian Bridge: BB(t) — B{t) — tB( 1). Our fourth class of Gaussian pro¬ 
cesses is the Ornstein—Uhlenbeck process, with zero mean (Of/) or different mean 
functions m(t) ( OUt ). Finally some “smooth” processes have been also include. They 
are obtained by convolving Brownian trajectories with Gaussian kernels. We have 
considered two levels of smoothing denoted by sB and ssB. 

In the following list of models, /i, denotes de distribution of X\Y = i and variables 
is the set of relevant variables in each Gaussian or Mixture case. We call them 
“relevant” in the sense that the optimal classification rule depends only on these 
variables. In the list below the variables written in boldface are “especially relevant” 
regarding their influence in the optimal classifier. 


1. Gaussian models considered: 


1 ri . /do : B(t) 

' : B(t) + et ,6 ~ 1V(0,3) 


2 . Gib: 

Idi = 


m 

B(t) + et 


,0~ A(0, 5) 


variables = {Aiqo}- 


variables = {Aiqo}- 
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2. logistic-type models considered: they are all defined according standard (ii) (see 
Sec. 4 in the main paper). The process X = X(t) follows one of the distributions mentioned 
above and Y = Binom(l, t/(X)) with rj(x) = (1 + e-V’Odb),"- a function of the 

relevant variables ■ ■ ■ ,x(tk). 

LI: ip(X) = 10X 65 . 

L2: ip(X) = IOX 30 + IOX 70 . 

L3: V>(X) = IOX 30 — IOX 70 . 

L4: X) = 2 OX 30 + 50X5o20Xgo. 

L5: 'ip(X) = 2 OX 30 — 5 OX 50 + 20X80- 

L 6 : 'ip(X) = IOX 10 + 3 OX 40 + IOX 72 + IOX 80 + 2 OX 95 . 

L7: V>(X) = 10X 1Oi . 

L 8 : ij>(X) = 20X | 0 + 10X| 0 + 50X| 0 . 

L9: i/j(X) = IOX 10 + 10|X,5o| + OX| 0 X85. 
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L 10 : i/j(X) — 2OX33 + 20 |X" 68 |- 
Lll : 1 >(X) = % + %. 

L 12 : ip(X) = log X 35 + log X 77 . 

L 13 : 'ip(X) = 4 OX 20 + 30X28 + 2 OX 02 + 10 X 57 . 

L 14 : ip(X) = 4 OX 20 + 30X28 — 20 X 52 — IOX 57 . 

L 15 : ip{X) = 4OX20 — 30 X 28 + 20X52 — 10X57. 

Some variations of these models have been also considered: 


L 3 b: ip(X) — 3OX30 — 2OX70. 

L 4 b: ip(X) = 3OX30 + 2OX50 + 10 X 80 - 

L 5 b: 'ip(X) = IOX30 — IOX50 + 10 X 80 - 

L6b: ip(X) = 20 Xio + 2OX40 + 2OX72 + 20 X 80 + 2OX95. 

L8b: V’(X) = 10 X| 0 + 10 X| 0 + 10 Xf 0 . 


3. Mixture-type models: they are obtained by combining (via mixtures) in several ways 
the above mentioned Gaussian distributions assumed for X|Y = 0 and X|Y = 1. These 
models are denoted Ml, ..., M10 in the output tables. 


1 . Ml : 


M 0 : 


B(t) + 3 t, 1/2 
B ( t ) - 2 1 , 1/2 


Ml : B(t) 


variables = {Xioo}- 


r / B(t) +3*2, 2 (t), 1/2 

M2: p- l B(t) + 5* 3 , a (t), 1/2 

(mi : B(t) 

variables = {.Y22.-Y35, X 48 ,-Y 75 , .Yioo}. 


/ £(t) + 3* 2 ,2(t), 1/10 

M0 ' \ B(t) + 5* 3 ,2(t), 9/10 

Mi : B {t) 


variables = {X22X35, X48,Jl 75, -Yioo}- 


f f B(t) +3*2,2(t), 1/2 

M4: J l B(t) + 5* 3 ,3(t), 1/2 

(mi : B (t) 

variables = {.Y 48 ,-Y@ 2 ,-Y 75 , Yioo}. 


( r b(*) +3*2,1 w ,1/3 

Mo : s B(t) + 3 * 2 ,2(t), 1/3 

( B ( t ) + 5*3,2 (t), 1/3 

Mi : B(t) 


variables = {Xi,X22,X35, X4g,X75, X100}. 
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6. M6 : 


B(t) + 3$ 2 ,i(i) ,1/2 

B(t) + 3t ,1/2 

B{t) 


9. M9 : 


Mo : 


Ml : 


MO : 

f B(t) + et, e ~ 1V(0, 5) 

,1/2 

l BB(t) 

,1/2 

Ml : 

B(t) 



variables = {Xi,X 22 , ^ 49 ,-X^ioo}- 


variables = .Yioo- 



[ MO : 

1 

r B(t) + 3*i,i(t) 

,1/2 


7. M7 : J 

l BBW 

,1/2 

10. M10 


[mi : 

B(t) 




MO : 


Ml : 


B(t) + 3*i,i (t) ,1/3 

B(t) - 3 1 ,1/3 

BB(t) , 1/3 

B(t) 


variables = {Xi,X 48 ,Xioo}- 


variables = {X\ ,X 48 ,Xioo}- 


{ f B(t) + 6t, 0~N(0,5) ,1/2 

M0 : \ B(t) + hillside 0 . 5 , 5 (t) ,1/2 

Ml : B(*) 


variables = {V47,Xioo}- 


B(t) + 3$i,i(t) ,1/4 

B(t)-3t ,1/4 

B(t) + hillsideo. 5 : 5 (t) ,1/4 
BB(t) ' ,1/4 

lMl : B(t) 

variables = {X\ ,-X’ 48 ,-X’ioo}- 


11. Mil : < 


MO : 


Finally, the full list of models involved is as follows: 


l. LI OU 

19. L4 OU 

37. L6 sB 

55. L10 sB 

2. LI OUt 

20. L4b OU 

38. L6 ssB 

56. L10 ssB 

3. LI B 

21. L4 OUt 

39. L7 0U 

57. Lll OU 

4. LI sB 

22. L4b OUt 

40. L7b OU 

58. Lll out 

5. LI ssB 

23. L4 B 

41. L7 OUt 

59. Lll B 

6. L2 OU 

24. L4 sB 

42. L7b OUt 

60. Lll sB 

7. L2 OUt 

25. L4 ssB 

43. L7B 

61. Lll ssB 

8. L2 B 

26. L5 OU 

44. L7 sB 

62. L12 OU 

9. L2 sB 

27. L5b OU 

45. L7 ssB 

63. L12 out 

10. L2 ssB 

28 . L5 out 

46. L8 B 

64. L12 B 

li. L3 OU 

29. L5 B 

47. L8 sB 

65. L12 sB 

12. L3b OU 

30. L5 sB 

48. L8 ssB 

66. L12 ssB 

13. L3 OUt 

31. L5 ssB 

49. L8b OU 

67. L13 OU 

14. L3b OUt 

32. L6 OU 

50. L9 B 

68. L13 out 

15. L3 B 

33. L6b OU 

51. L9 sB 

69. L13 B 

16. L3b B 

34. L6 OUt 

52. L9 ssB 

70. L13 sB 

17. L3 sB 

35. L6b out 

53. L10 OU 

71. L13 ssB 

18. L3 ssB 

36. L6 B 

54. L10 B 

72. L14 OU 
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73. 

L14 

out 

80. 

G1 

87. 

G6 

74. 

L14 

B 

81. 

Gib 

88. 

G7 

75. 

L14 

sB 

82. 

G2 

89. 

G8 

76. 

L15 

OU 

83. 

G2b 

90. 

Ml 

77. 

L15 

out 

84. 

G3 

91. 

M2 

78. 

L15 

B 

85. 

G4 

92. 

M3 

79. 

L15 

sB 

86. 

G5 

93. 

M4 


94. 

M5 

95. 

M6 

96. 

M7 

97. 

M8 

98. 

M9 

99. 

M10 

100. 

Mil 


Simulation results 

We next provide a few simulation results. See www.uaLm.es/antonio.cuevas/exp/mRMR-outputs. 
xlsx for the full simulation outputs. 


NB accuracy outputs 

Model MID FCD RD YD CD Base 


L7_0U 

Ll_OUt 

L14J3 

L9_sB 

Gib 

G3 

G6 

Ml 

M4 

M6 


75.89 

84.73 

79.74 
76.22 
83.13 
79.12 
64.73 

81.25 


87.01 

74.92 

77.31 

86.01 

75.94 

64.54 

83.50 

73.09 

68.04 

80.09 


90.41 

74.10 

77.17 

85.84 
80.73 

78.84 
84.38 
80.93 
67.91 
83.06 


87.72 
74.56 

76.72 
85.40 

81.66 
78.78 
83.96 

81.89 
67.48 
82.45 


90.50 

74.26 

76.75 

85.78 

77.67 

72.67 
84.28 
76.77 

67.89 
83.21 


93.35 

73.33 

75.33 

82.27 
79.49 
71.24 

74.90 
77.61 

61.36 
78.02 


Table 10.- Average NB accuracy (proportion of correct classification) outputs, over 200 runs of the 
considered methods with sample size n = 50. 


Model 

L7_OU 

LUOUt 

L14/B 

L9_sB 

Gib 

G3 

G6 

Ml 

M4 

M6 


NB number of variables 


MID FCD 

12.1 14.1 

8.9 8.0 

8.0 
7.5 
13.4 


6.4 

3.4 
4.9 

6.5 5.0 

7.1 7.6 


11.6 

I. 9 

II. 7 


7.9 

4.9 

6.9 


RD 

10T 

7.0 

6.0 

4.7 

4.8 

AT 

3.5 

4.5 
5.1 

5.5 


VD 

CD 

Base 

11.5 

11.1 

100 

5.9 


6.6 

100 

7.5 

5.9 

100 

4.9 

5.3 

100 

1.5 

9.0 

100 

3.7 

8.2 

100 

3.1 

2.7 

100 

1.4 

8.4 

100 

3.4 


4.6 

100 

5.6 

6.4 

100 


Table 11.- Average number of selected variables over 200 runs of the considered methods with sample 
size n = 50 using NB. 
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fc-NN accuracy outputs 


Model 


MID 


FCD 


RD 


YD 


CD 


Base 


L7_0U 

Ll_OUt 

L14J3 

L9_sB 

Gib 

G3 

G6 

Ml 

M4 

M6 


90.74 

75.83 

75.34 

86.79 

79.11 

73.39 

91.95 


81.63 

72.94 

83.08 


77.34 


77.16 


87.39 

75.74 

61.97 

84.16 

75.47 

71.60 

79.70 


86.89 90.78 87.79 


77.29 


87.35 


80.00 


77.28 


88.22 

82.79 


74.72 


84.13 


76.77 77.22 


76.20 

87.03 


80.01 


77.03 


85.80 

83.00 


70.30 


84.26 


90.67 
77.11 
76.49 
87.28 
78.13 
68.20 

84.68 
80.59 
71.66 
84.02 


92.21 


75.81 

74.43 

86.10 

78.57 

65.26 

92.19 


80.72 


73.29 


80.99 


Table 12.- Average fc-NN accuracy (proportion of correct classification) outputs, over 200 runs of 
the considered methods with sample size n = 50. 


k- NN number of variables 


Model 

MID 

FCD 

RD 


VD 

CD 

Base 

L7_OU 

11.5 

14.5 

10.4 

] 

12.1 

11.3 

100 

Ll.OUt 

9.0 

7.9 

6.9 


6.5 

6.8 

100 

L14J3 

8.3 

7.6 

5.5 


8.0 

6.5 

100 

L9_sB 

6.3 

7.7 

6.0 


7.1 

6.0 

100 

Gib 

7.8 

11.7 

6.5 


6.3 

8.7 

100 

G3 

5.1 

11.2 

2.5 


2.9 

7.8 

100 

G6 

11.5 

12.6 

9.0 


8.2 

7.5 

100 

Ml 

7.8 

11.4 

6.3 


4.9 

8.6 

100 

M4 

11.7 

16.1 

10.4 


9.7 

10.1 

100 

M6 

9.9 

9.6 

7.5 


8.7 

7.6 

100 


Table 13.- Average number of selected variables over 200 runs of the considered methods with sample 
size n = 50 using fc-NN. 
















LDA accuracy outputs 


Model 

MID 

FCD 


RD 

VD 

L7_OU 

89.27 

85.13 


89.75 

86.81 

Ll.OUt 

72.05 

73.74 


73.48 

73.96 

L14J3 

75.25 

76.35 


77.12 

75.62 

L9_sB 

84.91 

84.88 


84.96 

84.69 

Gib 

53.35 

52.22 


54.15 

54.49 

G3 

52.26 

50.91 


53.53 

53.51 

G6 

95.28 

87.92 


90.54 

86.59 

Ml 

54.44 

53.64 


54.88 

54.68 

M4 

78.95 

70.98 


76.46 

71.37 

M6 

79.92 

79.53 


80.80 

80.57 


CD 

90.10 

73.66 
76.27 
85.12 

51.67 
51.06 
87.80 

53.70 
71.30 

80.70 


Base 


Table 14.- Average LDA accuracy (proportion of correct classification) outputs, over 200 runs of the 
considered methods with sample size n = 50. 


LDA number of variables 


Model MID FCD RD YD CD Base 


L7_OU 

5.5 

6.9 

Ll.OUt 

5.6 

4.6 

L14J3 

3.6 

3.2 

L9_sB 

4.0 

3.5 

Gib 

5.6 

7.0 

G3 

7.1 

8.9 

G6 

10.7 

i 14.3 

Ml 

5.5 

7.2 

M4 

11.1 

10.2 

M6 

6.4 

4.5 


6.2 

4.6 

3.1 

3.3 

5.3 
jLCT 
11.2 

10.3 

5.2 


6.1 
4 A 

4.1 
3 A 

4.6 
533 

Y .8 

5.7 
7.4 

5.2 


7.2 

4.9 

3.6 
3.5 

7.4 

8.9 

11.6 
6.8 

8.5 
4.8 


Table 15.- Average number of selected variables over 200 runs of the considered methods with sample 
size n = 50 using LDA. 
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SVM accuracy outputs 


Model MID FCD RD YD CD Base 


L7_0U 

Ll.OUt 

L14J3 

L9_sB 

Gib 

G3 

G6 

Ml 

M4 

M6 


75.89 

84.73 

79.74 
76.22 
83.13 
79.12 
64.73 

81.25 


87.01 

74.92 

77.31 

86.01 

75.94 

64.54 

83.50 

73.09 

68.04 

80.09 


90.41 

74.10 

77.17 

85.84 
80.73 

78.84 
84.38 
80.93 
67.91 
83.06 


87.72 
74.56 

76.72 
85.40 

81.66 
78.78 
83.96 

81.89 
67.48 
82.45 


90.50 

74.26 

76.75 

85.78 

77.67 

72.67 
84.28 
76.77 

67.89 
83.21 


93.35 

73.33 

75.33 

82.27 
79.49 
71.24 

74.90 
77.61 

61.36 
78.02 


Table 16.- Average SVM accuracy (proportion of correct classification) outputs, over 200 runs of the 
considered methods with sample size n = 50. 


Model 

L7_OU 

Ll.OUt 

L14J3 

L9_sB 

Gib 

G3 

G6 

Ml 

M4 

M6 


SVM number of variables 

MID FCD RD VD CD Base 


12.1 

8.9 

7.9 

4.9 

6.9 

6.4 

3.4 

4.9 

6.5 
7.1 


14.1 

8.0 

8.0 

7.5 
13.4 
11.6 

I. 9 

II. 7 
5.0 

7.6 


10.1 

7.0 

6.0 

4.7 

4.8 

AL 

3.5 

1.5 
5.1 

5.5 


11.5 

AT 

7.5 
4.9 

IT 

T7 

3.1 

IT 

3.4 

5.6 


11.1 

AT 

5.9 

5.3 
9.0 
8.2 
AT 

8.4 
4.6 

6.4 


100 

100 

100 

100 

100 

100 

100 

100 

100 

100 


Table 17.- Average number of selected variables over 200 runs of the considered methods with sample 
size n = 50 using SVM. 


Results with quotient criterion and other classifiers 

The outputs included in the paper correspond to the difference criterion (in order to 
combine the relevance and redundancy measures). We provide here some additional 
results for the quotient criterion, instead of the difference one. Besides, Figure 2 in the 
paper was produced using only the outputs for the k -NN classfier. In this document 
we show the analogous displays for LDA, SVM and NB. Again, the entire simulation 
outputs can be downloaded from www. uara. es/antonio. cuevas/exp/mRMR-outputs . 
xlsx, 
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Tables la-9a below correspond and Tables 1-9 in the main paper with the difference 
criterion replaced with the quotient one. Figures 2a, 2b and 2c correspond to Figure 
2 using the other classifiers. 


Output (NB) Sample size MIQ FCQ RQ VQ CQ Base 


Avgerage accuracy n = 30 

78.10 

78.76 

79.53 


79.58 

79.12 

77.28 

n = 50 

79.59 

79.73 

80.86 


80.81 

80.26 

78.29 

n = 100 

80.62 

80.54 

81.82 


81.75 

81.16 

78.84 

n = 200 

81.24 

81.03 

82.48 


82.35 

81.77 

79.13 


Average dim. red 

n = 30 

n = 50 

n = 100 

n = 200 

8.9 

8.3 

7.7 

7.1 

8.6 

8.1 

7.2 

6.7 

7.2 

6.7 

6.1 

5.7 


7.0 

6.7 

6.1 

5.9 

7.9 

7.4 

6.9 

6.5 

100 

100 

100 

100 

Victories over Base 

n = 30 

60 

65 

72 


76 

68 

- 


n = 50 

67 

61 

79 


78 

68 

- 


n = 100 

71 

64 

88 


86 

79 

- 


n = 200 

75 

68 

92 


91 

84 

- 


Table la.- Performance outputs for the considered methods, using NB and the quotient criterion, 
with different sample sizes. Each output is the result of the 100 different models for each sample 
size. 


Output (A;-NN) 

Sample size 

MIQ 

FCQ 

RQ 

VQ 

CQ 

Base 

Avgerage accuracy 

n 

= 30 

80.02 

79.65 

80.85 


80.82 

80.09 

78.98 


n 

= 50 

81.32 

80.40 

81.72 


81.67 

80.87 

80.34 


n 

= 100 

82.84 

81.34 

82.73 

82.65 

81.83 

81.99 


n 

= 200 

84.06 

82.09 

83.56 

83.49 

82.65 

83.38 


Average dim. red n = 30 

n = 50 

n = 100 

n = 200 

9.3 

9.6 

9.9 

10.1 

9.5 

9.6 

9.9 

10.1 

7.4 

7.6 

8.0 

8.3 


7.6 

7.8 

8.2 

8.5 

8.1 100 

8.3 100 

8.6 100 

9.0 100 

Victories over Base n = 30 

71 

58 

76 


74 

67 

n = 50 

67 

53 

73 


72 

64 

n = 100 

71 

49 

64 


64 

55 

n = 200 

64 

42 

62 

64 

54 


Table 2a.- Performance outputs for the considered methods, using fc-NN and the quotient criterion, 
with different sample sizes. Each output is the result of the 100 different models for each sample 
size. 
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Output (LDA) 

Sample size 

MIQ 

FCQ 

RQ 

VQ 

CQ 

Base 

Avgerage accuracy 

n = 30 

78.67 

77.58 

78.64 

78.64 

78.15 

60.80 


n = 50 

80.20 

78.53 

79.58 

79.52 

79.11 

58.75 


n = 100 

81.74 

79.62 

80.65 

80.52 

80.25 

53.11 


n = 200 

82.90 

80.47 

81.53 

81.35 

81.10 

73.25 


Average dim. red n = 30 
n = 50 
n = 100 
n = 200 


5.8 

4.7 


4.7 


4.7 

5.1 100 

6.9 

5.7 

5.6 


5.6 

6.0 100 

8.3 

7.1 

6.9 


7.0 

7.3 100 

9.5 

8.3 

8.0 


8.1 

8.4 100 


Table 3a.- Performance outputs for the considered methods, using LDA and the quotient criterion, 
with different sample sizes. Each output is the result of the 100 different models for each sample 
size. 


Output (SVM) Sample size MIQ FCQ RQ VQ CQ Base 


Average dim. red 

n = 30 

n = 50 

n = 100 

n = 200 

10.6 

10.7 

11.1 

11.4 

10.3 

10.4 

10.5 

10.7 

9.1 

9.3 

9.5 

9.7 


9.2 

9.4 

9.7 

9.9 

9.5 

9.7 

9.9 

10.0 

100 

100 

100 

100 

Victories over Base 

n = 30 

32 

37 

49 


47 

42 

- 


n = 50 

35 

34 

51 


52 

44 

- 


n = 100 

35 

33 

51 


50 

48 

- 


n = 200 

33 

31 

52 


51 

48 

- 


Avgerage accuracy n = 30 

81.62 

79.81 

80.69 

80.65 

80.27 

81.91 

n = 50 

82.69 

80.42 

81.43 

81.35 

80.96 

82.99 

n = 100 

83.80 

81.21 

82.20 

82.12 

81.76 

84.11 

n = 200 

84.61 

81.79 

82.90 

82.76 

82.42 

84.91 


Table 4a.- Performance outputs for the considered methods, using SVM and the quotient criterion, 
with different sample sizes. Each output is the result of the 100 different models for each sample 
size. 
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Ranking criterion (NB) 

Sample size 

MIQ 

FCQ 

RQ 

VQ 

CQ 

Relative 

n = 

30 

2.27 

5.67 

8.69 


8.63 

7.65 


n = 

50 

2.69 

4.94 

9.09 


8.70 

7.51 


n = 

100 

2.75 

4.75 

9.21 


8.80 

7.44 


n = 

200 

2.71 

4.57 

8.87 


8.31 

7.41 

Positional 

n = 

30 

6.78 

7.83 

8.53 


8.50 

8.39 


n — 

50 

6.79 

7.57 

8.93 


8.51 

8.22 


n — 

100 

6.80 

7.47 

9.01 


8.58 

8.14 


n = 

200 

6.84 

7.56 

8.92 


8.43 

8.25 

FI 

n = 

30 

12.25 

15.85 

17.39 


17.58 

17.01 


n = 

50 

12.24 

14.90 

19.19 


17.37 

16.35 


n = 

100 

12.35 

14.60 

19.67 


17.38 

16 


n = 

200 

12.49 

14.92 

19.28 


16.86 

16.45 


Table 5a.- Global scores of the considered (quotient-based) methods using three different ranking 
criteria with the NB classifier. 


Ranking criterion (k- NN) 

Sample size 

MIQ 

FCQ 

RQ 

VQ 

CQ 

Relative 

n 

= 30 

3.70 

3.97 

8.51 


8.09 

6.03 


n 

= 50 

4.39 

3.72 

7.84 


7.59 

5.61 


n 

= 100 

4.93 

3.46 

7.24 


6.91 

5.15 


n 

= 200 

5.52 

3.00 

6.72 


6.52 

5.00 

Positional 

n 

= 30 

7.31 

7.29 

9.06 


8.64 

7.70 


n 

= 50 

7.55 

7.30 

8.90 


8.67 

7.58 


n 

= 100 

7.75 

7.37 

8.82 


8.62 

7.49 


n 

= 200 

7.96 

7.43 

8.56 


8.45 

7.60 

FI 

n 

= 30 

14.32 

13.96 

19.66 


17.80 

14.26 


n 

= 50 

15.27 

14.06 

18.78 


17.94 

13.95 


n 

= 100 

16.02 

14.19 

18.44 


17.90 

13.62 


n 

= 200 

16.84 

14.09 

17.48 


17.57 

14.02 


Table 6a.- Global scores of the considered (quotient-based) methods using three different ranking 
criteria with the fc-NN classifier. 
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Ranking criterion (LDA) 

Sample size 

MIQ 

FCQ 

RQ 

VQ 

CQ 

Relative 

n = 

30 

3.94 

3.14 

7.48 


7.38 

5.38 


n = 

50 

4.26 

2.86 

6.97 


6.66 

5.19 


n = 

100 

4.76 

2.60 

6.98 


6.43 

5.33 


n = 

200 

5.27 

2.36 

6.78 


6.13 

5.23 

Positional 

n = 

30 

7.49 

6.99 

9.01 


8.96 

7.55 


n = 

50 

7.64 

7.12 

8.90 


8.64 

7.70 


n = 

100 

7.72 

7.13 

8.89 


8.52 

7.74 


n = 

200 

7.80 

7.23 

8.79 


8.36 

7.82 

FI 

n = 

30 

15.05 

12.67 

19.11 


19.32 

13.85 


n = 

50 

15.63 

12.95 

18.91 


18.07 

14.44 


n = 

100 

15.80 

13.04 

18.83 


17.72 

14.61 


n = 

200 

16.25 

13.29 

18.62 


17.04 

14.80 


Table 7a.- Global scores of the considered (quotient-based) methods using three different ranking 
criteria with LDA. 


Ranking criterion (SVM) 

Sample size 

MIQ 

FCQ 

RQ 

VQ 

CQ 

Relative 

n = 30 

6.02 

2.85 

6.58 


6.28 

4.42 


n = 50 

5.99 

2.72 

6.70 


6.15 

4.73 


n = 100 

6.14 

2.61 

6.43 

5.99 

4.66 


n = 200 

6.42 

2.30 

6.29 

5.75 

4.68 


Positional 


n = 30 
n = 50 
n = 100 
n = 200 


.26 

.16 

.26 

.28 


7.20 

7.19 

7.31 

7.36 


,74 

,80 


,66 

,61 


,46 

,43 


,32 

,22 


7.34 

7.42 

7.48 

7.56 


FI 


n = 30 

17.90 

13.78 

17.99 

17.17 

13.16 

n = 50 

17.58 

13.51 

18.21 

17.28 

13.42 

n = 100 

17.97 

13.86 

17.71 

16.95 

13.59 

n = 200 

17.89 

13.85 

17.70 

16.58 

14.06 


Table 8a.- Global scores of the considered (quotient-based) methods using three different ranking 
criteria with the linear SVM. 
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Output Data 

Classification accuracy Growth 

Tecator 

Phoneme 


NB outputs 

MIQ FCQ 


73.08 80.20 


87.10 

97.67 


88.17 

96.28 


RQ VQ 


86.02 86.02 


99.53 


99.53 

80.38 

80.15 


CQ 

87.10 

98.14 

80.32 


Base 

84.95 

97.21 

74.08 


Number of variables Growth 1.5 

Tecator 4.8 

Phoneme 14.5 


1.1 

5.0 


1.1 

1.0 


1.1 

1.0 


1.1 

4.4 

31 

100 
1 256 

10.6 

16.7 

16.8 

14.1 


fc-NN outputs 


Output Data MIQ FCQ RQ VQ CQ Base 

Classification accuracy Growth 95.70 

83.87 

99.07 

83.87 83.87 

83.87 

98.60 

96.77 

Tecator 96.74 

[ 

99.53 

] [ 

99.53 

] 

98.60 

Phoneme 75.53 j 

81.42 

79.79 80.38 

80.61 

J 78.80 

Number of variables Growth 3.9 

Tecator 4.0 

Phoneme 18.4 

1.0 

3.0 


1.0 

1.0 


1.0 

1.0 


1.0 

4.3 

31 

100 

256 

12.1 

12.3 15.2 

6.7 


LDA outputs 


Output Data MIQ FCQ RQ VQ CQ Base 


Classification accuracy Growth 

Tecator 

95.70 

: t 

91.40 


91.40 


91.40 

91.40 

- 

94.88 


94.42 [ 

94.88 

94.42 

95.35 

Phoneme 

74.55 

78.88 

79.10 

79.63 

] 

80.26 

Number of variables Growth 

Tecator 

Phoneme 

3.7 

6.1 

19.0 


5.0 

8.4 

8.9 


4.9 

4.1 

10.0 


4.9 

2.2 

9.0 


5.0 

3.1 

9.2 



SVM outputs 


Output Data MIQ FCQ RQ VQ CQ Base 


Classification accuracy Growth 94.62 87.10 87.10 87.10 86.02 

95.70 

Tecator 98.14 

Phoneme 75.30 

99.07 


99.53 


99.07 

98.60 

99.07 

80.71 

80.67 80.37 80.33 

80.96 

Number of variables Growth 3.5 

Tecator 6.7 

5.0 

2.1 


4.9 

1.0 


4.9 

1.0 

5.0 31 

4.1 100 

j 12.2 256 

Phoneme 19.3 

10.1 

11.3 

10.8 


Table 9a.- Performances of the different (quotient-based) mRMR methods in three real data sets. 
From top to bottom tables stand for Naive Bayes, fc-NN, LDA and linear SVM outputs respectively. 


35 



































50 100 150 200 250 300 350 400 

Simulation experiments 


Figure 2a.- Cromatic version of the global relative ranking table taking into account the 400 con¬ 
sidered experiments (columns) and the Naive Bayes classifier. 



50 100 150 200 250 300 350 400 

Simulation experiments 


Figure 2b.- Cromatic version of the global relative ranking table taking into account the 400 con¬ 
sidered experiments (columns) and the Linear Discriminant Analysis. 
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Figure 2c.- Cromatic version of the global relative ranking table taking into account the 400 con¬ 
sidered experiments (columns) and the linear SVM. 
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